skip to main content


Search for: All records

Creators/Authors contains: "LeFevre, Jeff"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. With the ever-increasing dataset sizes, several file formats such as Parquet, ORC, and Avro have been developed to store data efficiently, save the network, and interconnect bandwidth at the price of additional CPU utilization. However, with the advent of networks supporting 25-100 Gb/s and storage devices delivering 1,000,000 reqs/sec, the CPU has become the bottleneck trying to keep up feeding data in and out of these fast devices. The result is that data access libraries executed on single clients are often CPU-bound and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to offload data-reducing processing and filtering tasks to the storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires an extensive understanding of the system internals. Previous approaches re-implemented functionality of data processing frameworks and access libraries for a particular storage system, a duplication of effort that might have to be repeated for different storage systems. This paper introduces a new design paradigm that allows extending programmable object storage systems to embed existing, widely used data processing frameworks and access libraries into the storage layer with no modifications. In this approach, data processing frameworks and access libraries can evolve independently from storage systems while leveraging distributed storage systems’ scale-out and availability properties. We present Skyhook, an example implementation of our design paradigm using Ceph, Apache Arrow, and Parquet. We provide a brief performance evaluation of Skyhook and discuss key results. 
    more » « less
  2. The incubator and research projects sponsored by the Center for Research in Open Source Software (CROSS, cross.ucsc.edu) at UC Santa Cruz have been very effective at promoting the professional and technical development of research software engineers. Carlos Maltzahn founded CROSS in 2015 with a generous gift of $2,000,000 from UC Santa Cruz alumnus Dr. Sage Weil and founding memberships of Toshiba America Electronic Components, SK Hynix Memory Solutions, and Micron Technology. Over the past five years, CROSS funding has enabled PhD students to not only create re- search software projects but also learn how to draw in new contributors and leverage established open source software communities. This position paper will present CROSS fellowships as case studies for how university-led open source projects can create a real- world, reproducible model for effectively training, funding and sup- porting research software engineers. 
    more » « less
  3. The Skyhook Data Management project (SkyhookDM.com) at the Center for Research in Open Source Software (cross.ucsc.edu) at UC Santa Cruz implements customized extensions through Ceph's object class interface that enables offloading database operations to the storage system. In our previous Vault '19 talk, we showed how SkyhookDM can transparently scale out databases. The SkyhookDM Ceph extensions are an example of our 'programmable storage' research efforts at UCSC, and can be accessed through commonly available external/foreign table database interfaces. Utilizing fast in-memory serialization libraries such as Google Flatbuffers and Apache Arrow, SkyhookDM currently implements common database functions such as SELECT, PROJECT, AGGREGATE, and indexing inside Ceph, along with lower-level data manipulations such as transforming data from row to column formats on RADOS servers. In this talk, we will present three of our latest developments on the SkyhookDM project since Vault '19. First, SkyhookDM can be used to also offload operations of access libraries that support plugins for backends, such as HDF5 and its Virtual Object Layer. Second, in addition to row-oriented data format using Google's Flatbuffers, we have added support for column-oriented data formats using the Apache Arrow library within our Ceph extensions. Third, we added dynamic switching between row and column data formats within Ceph objects, a first step towards physical design management in storage systems, similar to physical design tuning in database systems. 
    more » « less
  4. In the post-Moore era, systems and devices with new architectures will arrive at a rapid rate with significant impacts on the software stack. Applications will not be able to fully benefit from new architectures unless they can delegate adapting to new devices in lower layers of the stack. In this paper we introduce physical design management which deals with the problem of identifying and executing transformations on physical designs of stored data, i.e. how data is mapped to storage abstractions like files, objects, or blocks, in order to improve performance. Physical design is traditionally placed with applications, access libraries, and databases, using hard- wired assumptions about underlying storage systems. Yet, storage systems increasingly not only contain multiple kinds of storage devices with vastly different performance profiles but also move data among those storage devices, thereby changing the benefit of a particular physical design. We advocate placing physical design management in storage, identify interesting research challenges, provide a brief description of a prototype implementation in Ceph, and discuss the results of initial experiments at scale that are replicable using Cloudlab. These experiments show performance and resource utilization trade-offs associated with choosing different physical designs and choosing to transform between physical designs. 
    more » « less